Skip to main content

View Post [edit]

Poster: Moilleadóir Date: Apr 28, 2021 10:19pm
Forum: texts Subject: OCR Gaelic script

I’ve prepared a new PDF of an old book so that you can search the text which makes it much more useful. I’d like to do this for more books, but I wonder if there’s a way to either disable OCR or get Gaelic script added to Tesseract. I was going to suggest it at Gitlab, but it’s not open for registration.

All of the books on Archive.org in the old script have OCR files which are complete gibberish and it’s annoying that even when I try to amend things (by typing the whole book), my own efforts also get OCR’d and more gibberish is added.

I also wonder if there would be a way of using the text layer of the PDF instead of OCR.